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Abstract 

Mean field variational Bayes (MFVB) is a popular posterior approximation method due to its 
fast runtime on large-scale data sets. However, it is well known that a major failing of MFVB 
is that it underestimates the uncertainty of model variables (sometimes severely) and provides 
no information about model variable covariance. We develop a fast, general methodology for 
exponential families that augments MFVB to deliver accurate uncertainty estimates for model 
variables—both for individual variables and coherently across variables. MFVB for exponential 
families defines a fixed-point equation in the means of the approximating posterior, and our ap¬ 
proach yields a covariance estimate by perturbing this fixed point. Inspired by linear response 
theory, we call our method linear response variational Bayes (LRVB). We also show how LRVB 
can be used to quickly calculate a measure of the influence of individual data points on parameter 
point estimates. We demonstrate the accuracy and scalability of our method by learning Gaussian 
mixture models for both simulated and real data. 


1 Introduction 

With increasingly efficient data collection methods, scientists are interested in quickly analyzing 
ever larger data sets. In particular, the promise of these large data sets is not simply to fit old models 
but instead to learn more nuanced patterns from data than has been possible in the past. In theory, the 
Bayesian paradigm promises exactly these desiderata. Hierarchical modeling allows practitioners to 
capture complex relationships between variables of interest. Moreover, Bayesian analysis allows 
practitioners to quantify the uncertainty in any model estimates—and to do so coherently across all 
of the model variables. 

Mean field variational Bayes (MFVB), a method for approximating a Bayesian posterior dis¬ 
tribution, has grown in popularity due to its fast runtime on large-scale data sets HHS- But it is 
well known that a major failing of MFVB is that it gives underestimates of the uncertainty of model 
variables that can be arbitrarily bad, even when approximating a simple multivariate Gaussian distri¬ 
bution JUS], and provides no information about how the uncertainties in different model variables 
interact EH!. We develop a fast, general methodology for exponential families that augments 
MFVB to deliver accurate uncertainty estimates for model variables—both for individual variables 
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and coherently across variables. In particular, as we elaborate in Section [2j MFVB for exponen¬ 
tial families defines a fixed-point equation in the means of the approximating posterior, and our 
approach yields a covariance estimate by perturbing this fixed point. The perturbations of linear 
response theory have previously been applied for machine learning by (9) and specifically for mean- 
field methods by ITOl and ITTI . Our contribution is to use exponential families to derive particularly 
simple and scalable formulas for covariance estimation and to develop a method to quickly calculate 
influence scores, which measure the influence of individual data points on parameter point estimates. 
We call our method linear response variational Bayes (LRVB). 

We demonstrate the accuracy and scalability of our LRVB covariance estimates with experiments 
that focus on finite mixtures of multivariate Gaussians, which have historically been a sticking point 
for MFVB covariance estimates BIS). We employ simulated data as well as the MNIST handwritten 
digit data set d). We show that the LRVB variance estimates are nearly identical to those produced 
by a Markov Chain Monte Carlo (MCMC) sampler, even when MFVB variance is dramatically un¬ 
derestimated. For these mixture models, we show that LRVB gives accurate covariance estimates 
orders of magnitude faster than MCMC on a wide range of problems. We demonstrate both theo¬ 
retically and empirically that, for this Gaussian mixture model, LRVB scales linearly in the number 
of data points and approximately quadratically in the dimension of the parameter space. Finally, we 
show how LRVB allows fast computation of the influence scores mentioned above. 


2 Mean-field variational Bayes in exponential families 


Denote our N observed data points by the V-long column vector x, and denote our unobserved 
model parameters by 0. Here, 9 is a column vector residing in some space 0; it has J subgroups and 
total dimension D. Our model is specified by a distribution of the observed data given the model 
parameters—the likelihood p(x\9 )—and a prior distributional belief on the model parameters p(0). 
Bayes’ Theorem yields the posterior p{9\x). 

MFVB approximates p{9\x) by a factorized distribution of the form q{9) = Y\]-\ suc h 
that the Kullback-Liebler divergence KL(q| \p) between q and p is minimized: 


q* := argminKL(g||p) 

9 


= arg min E 9 

9 


logp(0|x)- lo g«(%) 
j=i.J 


By the assumed q factorization, the solution to this minimization obeys the following fixed point 
equations Q: 

log < 7 * (9j) = E q *. ie[J] \j logp(0, x) + C. (1) 

Here, and for the rest of the text, C denotes a constant and [J] := {1,..., J}. For index j, 
suppose that p{6j \ 0 ie tj}\j, x) is in natural exponential family form: 

P{0j\ 0 ielJ]\j’ x ) = ex P (VjOj - (2) 

with local natural parameter fjj and local log partition function A :l . Here, fjj may be a function of 
0,; e [j]\j and x. If the exponential family assumption above holds for every index j, then we can 
write fjj as a sum of products of components of each 9k vector: 


fjj — G r 9kr k , 

r£Rj ke[j]\j 


(3) 
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where G r is a Dj-sized column vector and 9kr k is a scalar. Here, r is a vector of length J — 1. Each 
entry of r is either 0 or an index in [£)&]. If = 0, then 9k rk = 1; otherwise, 9k rk is the r^.th 
element of the vector 9k- This notation scheme guarantees that each product contains at most one 
factor from the vector 9k for each index k. In particular, the log likelihood is linear in every vector 
9j. This property of the log likelihood is guaranteed by Eq. Q. Appendix A.l contains further 
details and a proof of Eq. (3k. 

It follows from Eqs. (JTi, (|2j, (|3j, and the assumed factorization of q* that log q* (9j) has the form 


e g - n 9 >+ c - 


(4) 


ir-Gitj k£[J]\j 


We see that q* is in the same exponential family form as p{9 3 \9 l(z y^j, x). Let r/j denote the natural 
parameter of q*, and denote the mean parameter of q* as rrij := E y/ • 9 :l . We see from Eq. (|4j that 


Vj = E Gr II mkr *• 

rGRj ke[J]\j 


Since rrij is a function of r/j , we have the fixed point equations rrij := Mj(rrii e \j]\j ) for mappings 
Mj across j and 

m := M(rri) 

for the vector of mappings M. 


3 Linear response covariance estimation 

Let V denote the covariance matrix of 9 under the factorized variational distribution q*(9), and let 
E denote the covariance matrix of 9 under the true distribution, p(9\x): 

V:=Cov q ,9, E:=Co v p 9. 

V may be a poor estimate of E, even when rn w E p 0, i.e. when the marginal means match well 
Our goal is to use the MFVB solution and the techniques of linear response theory HHEQ to 
construct an improved estimate for E. 

Define p t (9\x) such that its log is a linear perturbation of the log posterior: 

log p t (9\x) = \ogp(0\x) + t T 9 - C(t), (5) 

where C(t) is a constant in 9. If we assume that p(9\x) is a probability distribution with natural 
parameters in the interior of the feasible space, then p t (9\x) is a probability distribution for any t 
in an open ball around 0. Since C{t) normalizes the p t (9\x) distribution, it is in fact the cumulant 
generating function of p(9\x). Further, every (perturbed) conditional distribution p t (9j |0j e [j]Yj, x) 
is in the same exponential family as every (unperturbed) conditional distribution p (9 3 \ 9 ie imj, x) by 
construction. So, for each t, we have mean field variational approximation <p with marginal means 
rn t j := E q *9j and fixed point equations m-tj = Mtj(m t ^[j]\j) across j. Thus, m t = M t (m t ) 
as in Section[2] Taking derivatives of the latter relationship with respect to t, we find 

dm t dM t dm t dM t 
dt T dmj dt T dt T 
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In particular, note that t is a vector of size D (the total dimension of 9), and 'jnr, e.g., is a matrix of 
size D x D with (a, &)th entry equal to the scalar dmt, a /dtb. 

Since ql is the MFVB approximation for the perturbed posterior p t (9\x), we may hope that 
m t = E q » 6 is close to the perturbed-posterior mean E Pt 9. The practical success of MFVB relies on 
the fact that this approximation is often good in practice. To derive interpretations of the individual 
terms in Eq. & we assume that this equality of means holds, but we indicate where we use this 
assumption with an approximation sign: nit 
are given in Appendix |A.2| 

dnit 
dt T 
dM t 
dt T ' 
dM t 


E Pt 9. The full derivations of the following equations 




= V t 


(7) 


( 8 ) 


dmj 


= VfH t 


(9) 


where E t is the covariance matrix of 9 under p t , V t is the covariance matrix of 9 under q*, and 


H t := E n 


d 2 logp t (9\x) 


ddd9 T 

Then substituting Eqs. 0. #, and ([9]) into Eq. ([6]), evaluating at t = 0, and writing H for Hq 
and V for Vo, we find 


dmt 

dt T 


E := 

dt 1 t =o 

E = VHt + V => 

E = (. I-VHyW 

Thus, we call E the LRVB estimate of the true posterior covariance E. 


( 10 ) 


3.1 Exactness of multivariate normal and SEM 

Consider approximating a multivariate normal posterior distribution p(9\x) with MFVB. This case 
arises, for instance, given a multivariate normal likelihood with fixed covariance S and an improper 
uniform prior on the mean parameter /j: 

P(x |m) = II 7V(x„| n,S) and q*{p)= Qj(Pj) 


^=1■.N 


j=l-J 


Here, A f represents the multivariate normal distribution, and the total dimension D of p is equal to 
the number of components J. So p, is a J-length vector for J > 1 with elements fi -\...., //./, and S is 
a known J x J positive definite matrix. Our variational approximation, q* , is given by the factorized 
distribution over mean components. 

In this case, it is well known that the MFVB posterior means are correct, but the marginal vari¬ 
ances are underestimated if S is not diagonal. This fact is often used to illustrate the shortcomings 

of mfvb gm. 

However, since the posterior means are correctly estimated, the LRVB approximation in Eq. ( p~0| ) 
is in fact an equality. That is, for the posterior location of a multivariate normal with known covari¬ 


ance, Eq. (|10b is not an approximation, and E = = E exactly. A detailed proof of this fact can 


be found in Appendix |B| 
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This result draws a connection between LRVB and the “supplemented expectation-maximization” 
(SEM) method of JH. SEM is an asymptotically exact covariance correction for the EM algorithm 
that transforms the full-data Fisher information matrix into the observed-data Fisher information 
matrix using a correction that is formally similar to Eq. ( [TO} . In this sense, SEM is a frequentist 
perspective on a special case of the LRVB correction when the amount of data goes to infinity. More 
details can be found in Appendix |B| 


4 Scaling 

Eq. ( [TO} requires the inverse of a matrix as large as all the unknown natural parameters in the pos¬ 
terior p(9 |x), which generally includes both main parameters and nuisance parameters. In many 
applications, the number of nuisance parameters may be very large. For example, in the finite mix¬ 
ture of normals model below (Section[6}, there is an indicator variable for the cluster assignment for 
each data point. If we treat these variables as nuisance parameters, the number of nuisance parame¬ 
ters grows with the number of data points N. As a result, directly computing the matrix inverse in 
Eq. ( [TO} may be impractical. 

However, since the variational covariance V is block diagonal and H is often sparse, one may 
be able to use Schur complements to efficiently calculate sub-matrices of E. Suppose that our full 
parameter space, 9, can be divided into a small number of variables of primary interest, called a, 
and a large (and possibly growing) number of nuisance variables, z: 



We can similarly define partitions for H and V. Assume also that we have the usual mean field 
factorization of the variational approximation: q*(a,z) = q*(a)q* (z), so that V az = 0. (The 
variational distributions may factor further as well.) We calculate the Schur complement of E in 
Eq. ( [TO} with respect to its 3th component to find that 

E q = (11) 

(I a - V a H a - V a H az (I z - VzHJ-'VzH^)- 1 V Q 


Here, /„ and I z ref er to a- and z-sized identity matrices, respectively. A detailed derivation can be 


found in Appendix A.2 In cases where ( I z — V z H z ) 1 can be efficiently calculated, Eq. 0 in¬ 


volves only an o-sizcd inverse. A finite mixture of Gaussians model, which we describe in Section[6] 
is one such case. 


5 Influence scores 

Influence scores are a powerful tool from classical linear regression that describe how much influ¬ 
ence a particular data point has on a modeled outcome. They can be used, for example, to identify 
outliers and investigate the robustness of the linear model 033 Q5). Analogously, it can be use¬ 
ful know how much Bayesian posterior means depend on the values of individual data points. A 
number of authors have proposed methods to measure the sensitivity of the posterior distribution to 
perturbations or deletions of data points both in linear models 1X61(17 1 and more generally II1814201 . 
LRVB gives a convenient formula to analytically calculate the influence of individual data points as 
covariances between the 9 vector and infinitesimal noise added to the data. 
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Consider the conditional expectation of a single parameter value, 9i , as a function of a single 
data point, x n . Specifically, for notational convenience, define 

TTlOi (*^n) |•t'1; **•) -tni %n\ 

One measure of the sensitivity of 9i to x n is the derivative of this function, mg i ( x n ) = m' g . ( x n ). 
We will refer to this derivative as an influence score for Bayesian models. 

To draw a connection between this influence score and covariances, imagine that our observa¬ 
tions, x, are in fact slightly noisy versions of the true data, x*. Specifically, our model becomes 

p(x\x* ,9) = p(x\x*)p(x*\9). 


In this new model, x* are unknown parameters, like 9. We assume our posterior beliefs about the 
true x* obey the following assumption^ 


E(x*|x) 
Cov (x*|x) 

S x 
0 ^s x 

Higher moments 


— 

:= lim - Yi x 
e-ro e 

< oo 

= O (e p ), for p > 2 


( 12 ) 


That is, the covariance matrix E x is proportional to e. Conditional on x, mg i (x*) is a random 
variable that varies as the posterior belief about x* varies around x,. By forming a Taylor expansion 
of m' g . (x* ) around x n we show for any data point x n and any parameter 9i that: 

m'g. (x n ) = lim -Cov (9 Z , x* lx) (13) 

1 e->o e 

(Appendix [5] contains further details.) That is, the limiting value of this covariance as e —> 0 
can be used to estimate the influence of observation x„ on the mixture parameters in the spirit of 
classical linear model influence scores from the statistics literature. 

Note that the covariances on the right hand side of Eq. ( fl3| are impossible to compute in naive 
MFVB, since they involve correlations between distinct mean field components, and difficult to 
compute using MCMC, since they require estimating a large number of very small covariances with 
a finite number of draws. However, LRVB leads to a straightforward analytic expression for these 
covariances. 

To derive the LRVB influence scores, we now assume that the parameter space of the posterior 
can be divided into three types of variables. We have main and nuisance parameters, called a and 
z respectively as in Section [4] and now also x*, the unobserved data. As before, we also assume 
that each has its own variational distribution, i.e. q*{9) = q*(a)q*(z)q*(x*). (The variational 
distributions may factor still further.) We can write: 


/ a \ 



2j ax * 


x* 

and E = 

V * 
t-'x a. 

£** 

5Vz 

V - / 


^za 




1 Note that if each observation x* has only one sufficient statistic, the perturbations can be treated as independent, and 
S x will be the identity. However, if each observation ic* has a vector of sufficient statistics, S x must take that structure into 
account. For example, if x n is drawn from a normal distribution centered at a:* , it will have sufficient statistics x n and x 
which will be correlated with one another. These correlations will cause S x to be different from the identity in general. 
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We use a similar partition for V and H. Recall that E x is the result of an infinitesimal pertur¬ 
bation and nearly zero, so we express our results in terms of S x in Eq. m Let Eq, denote the the 
ordinary LRVB covariance of a from Eq. m The covariance between a and the infinitesimally 
perturbed x then has the following formula: 

lim -E ax . = (14) 

e ->0 e 

E -\V a H ax * + V a H az (I z - V Z H Z )~ X V Z H ZX ,)S X 

This formula follows from the Schur inverse and taking e —> 0. Details can be found in Appendix[D] 

6 Experiments 

Mixture models constitute some of the most popular models for MFVB application JT] |2) and are 
often used as an example of where MFVB covariance estimates may go awry 0161. We thus 
illustrate the efficacy of LRVB on the problem of approximating the posterior when the likelihood 
is a finite mixture of multivariate Gaussians. 

6.1 Model 

We consider a A'-component mixture of P-dimensional multivariate normals with unknown com¬ 
ponent means, covariances, and weights. In what follows, the weight irk is the probability of the fcth 
component, A f denotes the multivariate normal distribution, Hk is the P-dimensional mean of the 
fcth component, and A/,. is the P x P precision matrix of the fcth component (so E/ r := A^ 1 is the 
covariance). N is the number of data points, and x n is the nth observed P-dimensional data point. 
We employ the standard trick of augmenting the data generating process with the latent indicator 
variables z n k, where n = 1,..., N and fc = 1,..., K, and 

P(Znk = 1) = TTfc 

Znk = 1 ^ ~ -V(/Xfc, A k ) 

The full likelihood under this augmentation is 

p(x\ir,n,A,z) = N(x n \^k 1 A k 1 ) Zrik (15) 

n=l:N k=l:K 

We assign independent variational factors to //., n. A, and The z variables are nuisance parame¬ 
ters. 

Our goal is to estimate the covariance matrix of the parameters log(7r),//, A in the posterior 
distribution p(n, ii. A|x) and to estimate the influence of each data point x n on the posterior means 
of log(7r),/x, A using LRVB (see Sections[3]and|5ji. 

In addition to the standard MFVB covariance matrices, we will compare the accuracy and 
speed of our estimates to Gibbs sampling on the augmented model (Eq. ( p~5] >) using the function 
rnmixGibbs from the R package bayesm. Our LRVB implementation relied heavily on linear 
algebra routines in RcppEigen SH|. We evaluate our results both on simulated data and on the 
MNIST data set fl2l . 

2 Unlike Section [ 3 TT] the variational posteriors for fi factor across components but not within components. That is, for 
each k, q*(^k) is a multivariate (not a univariate) normal distribution. 


7 



6.2 MNIST data set 


For a real-world example, we applied LRVB to the unsupervised classification of two digits from 
the MNIST dataset of handwritten digits. We first preprocess the MNIST dataset by performing 
principle component analysis on the training data’s centered pixel intensities and keeping the top 25 
components. For evaluation, the test data is projected onto the same 25-dimensional subspace found 
using the training data. 

We then treat the problem of separating handwritten Os from Is as an unsupervised clustering 
problem. We limit the dataset to instances labeled as 0 or 1, resulting in 12665 training and 2115 
test points. We fit the training data as a mixture of multivariate Gaussians. Here, K = 2, P = 25, 
and N = 12665. Then, keeping the //, A, and 7r parameters fixed, we calculate the expectations 
of the latent variables z in Eq. © for the test set. We assign test set data point x n to whichever 
component has maximum a posteriori expectation. We count successful classifications as test set 
points that match their cluster’s majority label and errors as test set points that are different from 
their cluster’s majority label. By this measure, our test set error rate was 0.08. We stress that we 
intend only to demonstrate the feasibility of LRVB on a large, real-world dataset rather than to 
propose practical methods for modeling MNIST. 


6.3 Covariance experiments 


In this section, we check the covariances estimated with Eq. (10 1 against a Gibbs sampler, which we 
treat as the ground truthj^] 

For simulations, we generated N = 10000 data points from K = 2 multivariate normal com¬ 
ponents in P = 2 dimensions. MFVB is expected to underestimate the marginal variance of fjb, 
A, and log(7r) when the components overlap since that induces correlation in the posteriors due to 
the uncertain classification of points between the clusters. These correlations are in violation of the 
MFVB assumption and cause the MFVB posterior variances to be mis-estimated. 

We performed 68 simulations, each of which had at least 500 effective Gibbs samples in each 
variable—calculated with the R tool ef fectiveSize from the coda package ll22ll . We note that 
for each of the parameters log(7r), /i, and A, both MH and MFVB produce posterior means close 
to the ground truth MCMC values, so our key assumption in the LRVB derivations of Section [3] 
appears to hold. 

Each point in Fig. |T| represents the a single parameter in a single simulation. For example, 
each point on the A graph represents the marginal standard deviation of a particular component of 
the A matrix for both the Gibbs sample and an alternative method. The first three graphs show the 
diagonal standard deviations, and the final graph shows the off-diagonal covariances. Note that the 
final graph excludes the MFVB estimates since most of the values are zero. 

Fig. Q shows that the raw MFVB covariance estimates are often quite different from the Gibbs 
sampler results, while the LRVB estimates match the Gibbs sampler closely. Although not shown, 
the results on the MNIST dataset were as good. 

In these simulations, on average LRVB took only 3.40 seconds, whereas the Gibbs sampler took 


306.97 seconds. We explore these timing tradeoffs in more detail in Section 6.4 


3 The likelihood described in Section |6T| is symmetric under relabeling. When the component locations and shapes have a 
real-life interpretation, the researcher is generally interested in the uncertainty of //. A, and ir for a particular labeling, not the 
marginal uncertainty over all possible re-labelings. This poses a problem for standard MCMC methods, and we restrict our 
simulations to regimes where label switching did not occur in our Gibbs sampler. The MFVB solution conveniently avoids 
this problem since the mean field assumption prevents it from representing more than one mode of the joint posterior. 
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Method: • LRVB • MFVB 



log(7u) 

Method: • LRVB • MFVB 



Gibbs std dev 


Method: • LRVB • MFVB 



All off-diagonal covariances 


Method: • LRVB 



Gibbs off-diagonal covariance 


Figure 1: Comparison of estimates of the posterior covariance matrix on simulation data for each 
model parameter from Gibbs, MFVB, and LRVB methods. In the simulations, N = 10000 (data 
points), K = 2 (components) and P = 2 (dimensions). 


6.4 Scaling experiments 

In this section we show that, for the finite mixture of multivariate Gaussians model, Eq. © scales 
linearly with N and polynomially in K and P. We also use simulations to experimentally compare 
the scaling of LRVB running times to Gibbs sampling estimates. We show that LRVB is much 
faster than Gibbs for the range of parameters we simulated, though Gibbs may be preferable for 
very high-dimensional problems. 

In the terms of Section [4] a includes the sufficient statistics from //, ^ r, and A, and grows as 
0(KP 2 ). The sufficient statistics for the variational posterior of // contain the P-length vectors /.if-, 
for each k, and the (P + l)P/2 second-order products in the covariance matrix iJk/iJ- Similarly, for 
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each k , the variational posterior of A involves the (P + 1 )/ :, /2 sufficient statistics in the symmetric 
matrix Afc as well as the term log | A*, |. The sufficient statistics for the posterior of tt j. are the K terms 
log 7Tfc j^This means that, minimally, Eq. (101 will require the inverse of a matrix of size 0(I\P 2 ). 

The sufficient statistics for z have dimension K x N. In other words, the number of nuisance pa¬ 
rameters grows with the number of data points, but H z = 0 for the multivariate normal (Appendix|C| 
contains further details), so we can apply Eq. (11 1 to replace the inverse of an 0(K iV)-sized matrix 
with multiplication by an O(KN )-sized matrix. Here, z conveniently corresponds directly to the z 
in Section |T] 

Since a matrix inverse is cubic in the size of the matrix, the worst-case scaling for LRVB is then 
0(K 2 ) in K, 0(P 4 5 6 ) in P and O(N) in N. 


In our simulations, shown in Fig. 0. we can see that, in practice, LRVB scales linearly in N 
and slightly less than quadratically in P, which is much better than the theoretical worst case. Note 
that the vertical axis, the time to run the algorithm, is on the log scale. At every value of P, I\, and 
N examined here, calculating LRVB is much faster than Gibbs sampling]^] 


6.5 Influence score experiments 

We now demonsttate the accuracy of the influence score formula, Eq. ( fl4| , on simulated data and 
on the MNIST dataset. We first compare Eq. (jT4|» to numeric derivatives. We then look at some 
patterns in the data that are made visible by having easy-to-calculate influence scores. 


6.5.1 Comparison with numeric derivatives 


In order verify that LRVB influence scores are correct for this model, we manually perturbed each 
component of the data and re-fit to find the new MFVB optimum. In other words, we numerically 
compute the derivative in Eq. (13 1 . 

Our simulation used N = 10000, P = 2, and K = 2. This was small enough to calculate the 
influence score for every data point in every dimension. To select sample data points for MNIST 
influence scores, we selected 5 representative data points with different ranges of E g » z n values so 
that some had their posterior probability concentrated in only one component, and some that were 
uncertainly classified between the two components. We then computed the LRVB influence scores 
from Eq. ( |T4| . For comparison, we again manually perturbed each dimension of each data point and 
re-optimized. 

On MNIST, not counting the time to compute the initial LRVB covariance, calculating the LRVB 
influence scores took 37 seconds, and the process of perturbing and re-optimizing took 20.7 minutes. 

The comparison between numeric differentiation and LRVB influence scores for both a sim¬ 
ulation (left) and MNIST (right) is shown in Fig. 0 - The influence scores obtained from the two 
approaches are practically indistinguishable. Though not shown, in both simulations and on MNIST, 
for each /i. A, and log(7r) parameter, the two methods were as similar to one another as in Fig. (|3j. 


4 Since XA=i ir n k — 1, using K sufficient statistics involves one redundant parameter. However, this does not 
violate any of the necessary assumptions for Eq. {To}. and it considerably simplifies the calculations. Note that though the 
perturbation argument of Section[3]requires the natural parameters of p(6\x) to be in the interior of the feasible space, it does 
not require that the natural parameters of p(x\6 ) be interior. 

5 For numeric stability we started the optimization procedures for MFVB at the true values, so the time to compute the 

optimum in our simulations was very fast and not representative of practice. On real data, the optimization time will depend 
on the quality of the starting point. Consequently, the times shown for LRVB are only the times to compute the LRVB 
estimate. The optimization times were on the same order. The Gibbs sampling time was linearly rescaled to the amount of 
time necessary to achieve 1000 effective samples in the slowest-mixing component of any parameter. 
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Method: LRVB Gibbs 
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LRVB 
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Figure 2: Scaling of LRVB and Gibbs on simulation data in both log and linear scales. Before taking 
logs, the line in the (N) graph is y cx x, and in the (P) graph, it is y oc x 2 . 


6.5.2 Influence score data 

One can see interesting patterns in the data with influence scores. Consider, for example, the sim¬ 
ulated data, which is depicted in Fig. (|4]) and has moderately overlapping components. The graph 
shows the effect on /in of perturbing the x n i (horizontal) coordinate of each datapoint. One can 
see that the component’s mean is essentially determined by the points that are assigned to it. Inter¬ 
estingly, points on the border between the two components reverse the sign of their effect. This is 
caused by changes in E ? * z n that more than counterbalance their effects on E g * /r. 

As can be seen in the simulated data of Fig. 0, the situation becomes more complex when the 
components overlap even more. Data points that are distant from a component center have nearly as 
much influence as data points well within the component. 
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Simulation \i MNIST A 




Numeric derivative Numeric derivative 


Figure 3: LRVB influence score accuracy. The simulated data uses all x n , and the MNIST data uses 
five representative x n . 


Finally, we consider influence scores in the MNIST data set. We selected 100 data points with 
a 0 or 1 label uniformly at random. In Fig. <0 we considered the influence of one particular 
dimension of each data point on one particular dimension of each component mean. Recall from 
Section 6.2 that each data point and component mean is 25-dimensional. Now we wish to derive 
a single influence score for the effect of each data point vector-valued x n on each vector-valued 
component mean . 

To that end, we define the influence of a data point x n on fit as the directional derivative of 
||/rfe||| with respect to x n . That is, we calculate the vector 9||/r,||2/<9.T n using LRVB as described 
above. Then we compute 


Influence of x n on /i^ 


max 

<5=PII=1 


( d\W\\t 

\ dxl 


The resulting influences are plotted in Fig. 0- Each point corresponds to a data point x n . Each 
sub-figure corresponds to a different component mean parameter. The horizontal axis value for point 
x n is the logit of E q *z n k (capped at ±15), which measures the posterior probability that x n came 
from component k. 

The two components show very different patterns. Component 0, the mode with mostly hand¬ 
written zeroes, has much higher influence amongst points that are classified within it than component 

1 . 


7 Conclusion 

The lack of accurate covariance estimates from the widely used mean-field variational Bayes (MFVB) 
methodology has been a longstanding shortcoming of MFVB. We have demonstrated that our method, 
linear response variational Bayes (LRVB), augments MFVB to deliver these covariance estimates 
in time that scales linearly with the number of data points. We have also shown how to use LRVB 
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to quickly calculate influence scores, a measure of the influence of each data point on posterior pa¬ 
rameter means. Our experiments have focused on mixtures of multivariate Gaussians since these 
have traditionally been used to illustrate the difficulties with MFVB covariance estimation. We hope 
that in future work our results can be extended to more complex models, including latent Dirichlet 
allocation and Bayesian nonparametric models, where MFVB has proven its practical success. 
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Figure 4: Influence scores for components with different amounts of overlap. Each graph shows 
the influence of x n \ on /zn, which is mean of the upper-right hand component. (X) indicates a 
component posterior mean. 
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Figure 5: Different influence patterns for the two Gaussians in the MNIST dataset. The 100 data 
points were chosen randomly. The vertical axis shows the maximum directional derivative of fip- 
with respect to changes in the data point, and the horizontal axis shows the (capped) logit posterior 
probability that the point came from that component. 
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Supplementary Material 

A Derivations 

A.l MFVB for conditional exponential families 

First, we require some notation for indexing 8. Recall that the MFVB assumption partitions the 
components of 9 into J groups according to the factorization 

j J 

9(0)=n «,(*,•)= 

j =i j=i 

We follow the common abuse of notation of defining the variational functions through their argu¬ 
ments by writing q(9j) for qj{9j). 

Each 9j is be a Dj -dimensional vector, with Dj = D , where the whole 9 vector has 

dimension D. Here and below, we will define the set [J] := {1, J}. 

Let R{9) denote a vector of length + 1) that is the vector of all possible products 

of the form IljefJ] 0?i,> where for each j, ij £ {0,1, Dj}. That is, R{9) is the vector of all 
possible products of terms of 9 where with at most one term from each 9j, and we define Ojq, := 1. 
Let Rj (9) denote the same vector, but excluding terms from 9j, and Rjji (9) denote the same vector, 
but excluding both 9j and 9j' . 

To aid in intuition, it will sometimes be useful to explicitly write the inner product of R(9) with 
a vector G as a sum of products of components of 9j. Let G be a |i?(0) |-length vector with element 
Gi, and let f?(0), denote the ith row of R{9). Then define 

|B(S)I 

G t R( 9) = Y GiR(9)i ■■= Y II (16) 

i —1 r£R jS[J] 

Here, we have “overloaded” the definition of R to express the sum over different products of 
9. We define r :) as the index of 9j in the row of R(0) corresponding to rj, G r as the element of G 
corresponding to that index set, and the sum over r £ R as the sum over all rows. 

We define the inner product of G with Rj ( 9) similarly, only with the result as a Dj -length vector. 
Specifically, define 


Y°r n 6 *ru 

rGRj ke[J]\j 

such that G r is a Dj -sized column vector. Linally, define 

Y G r n 

reR j:j > fceWMjd'} 

the same way, except where G r is a Dj x Dy matrix. 

This notation is intended to make it easy to express sums of products of elements of 9 in such 
a way that no two terms from a single 9j are multiplied together. The value of this notation will 
hopefully become clear in the following lemmas. 
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Lemma A.l. Suppose Eq. (|2| holds across all j; that is, 

P(9j\0ie[J]\j, x ) = ex P (vjOj - MVj))- 
Then the posterior p(9\x) can be written in the form 

log p{6\x) = E^Il e jr ,+c (17) 

reR j£[J] 

where the terms G r and C are constant in all 0^] 

Proof. We see that \ogp{9\x) = log p(9j\9 ie [j]\j, x) + logp(0j g [j]\j|a;) depends on 0 :] only via the 
first term in the sum. By Eq. 0- 

P{0j\ e ie[J]\j’ x ) = ex P (pJOj - A j{Vj)) 

It follows that logp(6*|x) is linear in the vector 9j. But this is true for all 6j, and the above form for 
p(9\x) follows. □ 

Lemma A.2. Suppose Eq. fj2j holds across all j. Then, for the natural parameter fjj, we have the 
following equations: 


Vj — G r 0 9kr k 
reRj fce[J]\j 

<91ogp(0|:c) 

Vj = ~~dOj 

dp / d 2 logp(6>|a:) \ 

'' dm T q * ^ d9d9 T J 


( 18 ) 


Here, each G r is an Dj-length vector that is constant with respect to 9. 


Proof. The first result follows by collecting the terms for the zth component of 9j in Eq. (17 i and 
applying Bayes’ theorem. The second is simply observing that differentiating with respect to 9j is a 
notationally tidy way to collect the j terms. 

For the third result, recall from Eq. (|4]i that 


Vj = Eg* [Vj] 


®Vj 

dmJ, 


E„ 


G r 0 9kr k 

rGRj k£[J]\j 


n 

reRj ke[J]\j 


Y G' r 0 m krk 
reRjj' ke[ J]\{j,j'} 


6 Strictly speaking, C is redundant since {0, ...,0} E R. Here and below, for additional clarity we will always write a 
constant. 
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= E 


q* 


E G 'r n ^r k 


= E 


9 * 


d\ogp(9\x) 

dOidej, 


Note that this proof relied on the fact that only one element of Of is in each product term, which 
allowed us to exchange the derivative with respect to the expectation with the expectation of the 
derivative with respect to 9y. 

□ 


A.2 Linear response 

We here derive the three equalities in Eqs. 0, ([8]), and 0, which appear respectively as three 
propositions below. In these propositions, we assume that p(9\x) is in the exponential family as 
above. We will further assume that all natural parameters (for p or variational approximations) are 
in the interior of the parameter space. The fact that the natural parameters are on the interior of the 
feasible space means that there exists an open ball around them that is also feasible. Let that ball 
have radius <5, and let t be within a 8 ball of the origin. Then p t (6 \x) is well defined for t in an open 
set containing zero. These assumptions will allow us to apply dominated convergence (cf. Section 

2.3 of 1251). 


Proposition A.3. ^E Pt 9 = E t . 

Proof. 


7 7 r t 

—pE Pt 6 = —y / 9e t e ~ c ^p(6\x)d6 by the definition of p t in Eq. 0 
dt 1 dt 1 J q 

r ~ d t 

= / 0 p(9\x)d0 by dominated convergence 

Je [dt J 

= f 00 T e tTf> -< t) p(6\x)d6 - f Qe tTe - c{t) p(e\x)d6 ■ 

Je Je 


dc{f) 

dt T 


= e P( [ee T ] — E Pt [o] E pt [ef = s t 


□ 


To approximate c jrpf, we assume not only that m t w E P( 0 for any particular t but further that 
rnt tracks the true mean E Pt 6 as t varies. In this case, by Proposition |A.3| we have 

dmt d 


dt T dt T 


E Pt 9 — E t , 


the first (approximate) equality in Eq. 0. 

To derive the final two equalities in Eqs. <[0) and 0, we make use of the following lemma. 
Lemma A.4. M t j depends on t only via the natural parameter of the q* ;) distribution. And 


dr )i 
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Proof. The first part of the lemma follows from writing the definition of M t j\ 

M t,j = = [ Oj ex P {PtjOj - Aj (tkj)) dO r 

J 0 j 


For the second part. 


dM t 




dr >L 


d <3 


-6j exp (jlfjOj — Aj ( rjtj )) d6j by dominated convergence 


= 0 ,- 


= 




exp (vtjOj - Aj (//,.,)) ddj 


a 


Proposition A.5. = V t . 


8 t T 


Proof. By Lemma A.4 we have for any indices i and j in [J] that 

dM t j _ dM t j dp t ,j 
dtf drift dtf ’ 


(19) 


where the first factor is also given by Lemma 


A.4 


It remains to find the second factor, . By the 

discussion after Eq. ([ 2 J) and the construction of p t , the natural parameter fj t ,j of pt (Oj |0je[j]\jj x ) 
satisfies 


Vt,j — VI G r @kr k + tj. 
reRj fee[J]\i 


So, as in the derivation of Eq. Q, the natural parameter p t j of q* ( 0 :l ) satisfies 


Vt,ji — ^ m t,krk A? 

reRj k£{J]\j 


( 20 ) 


for mt >r ■= Eq* v 0 r . 


Let dj be the dimension of 0 t and hence the dimension of q t j and tj. Hence, 


d V t,j 

dtf 


J = * 


0 d^di else 

where I a is the identity matrix o f dim ension a, and is the all zeros matrix of dimension a x b. 
Finally, by Eq. § , Lemma 

dM t 


and the expression for just obtained, we have 


dt T 


= V t I D = V t . 


□ 


Proposition A.6. = 

Proof. By Lemma A.4 and analogous to Eq. we have 

dM t j _ dM t j drjtj 
dm li dr ifj dm li ’ 

The result follows immediately from Lemma [A~4| 


( 21 ) 

□ 
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B Multivariate normal posteriors and SEM 


For any target distribution p(9\x), it is well-known that MFVB cannot be used to estimate the co- 
variances between the components of 9. In particular, if q* is the estimate of p(0\x) returned by 
MFVB, q* will have a block-diagonal covariance matrix—no matter the form of the covariance of 
p{9\x). By contrast, the next result shows that the LRVB covariance estimate is exactly correct in 
the case where the target distribution, p(9\x), is (multivariate) normal. 

In order to prove this result, we will rely on the following lemma. 

Lemma B.l. Consider a target posterior distribution characterized by p(9\x) = Af(9\p, E), where 
p and E may depend on x, and E is invertible. Let 9 = {9\,... ,9j), and consider a MFVB 
approximation to p{9\x) that factorizes as q{9) = . q(9j). Then the variational posterior means 

are the true posterior means; i.e. nij = Pj for all j between 1 and J. 

Proof The derivation of MFVB for the multivariate normal can be found in Section 10.1.2 of 0; 
we highlight some key results here. Let A = E -1 . Let the j index on a row or column correspond 
to 9j , and let the —j index correspond to {9; : i £ [J] \ j}. E.g., for j = 1, 


An Ai,_i 

A-i.i A_i_i 


By the assumption that p(9\x) = M{9\p 1 E), we have 
1' ogp{9 j \9 ie[J] \ j ,x) = - /(,) 7 A- Pj) + (9j - /(,.} 7 -V f0 j - p_ 3 ) + C, (22) 

where the final term is constant with respect to 0 ;i . It follows that 


log q*j{0j) = ^qT:ie[J]\j log p(0,x) + C 

= —~z9j hjj9j + djPjAjj — 9jAj ^j(E q *9-j — p~j). 


So 


with mean parameters 


q*(e j )=Af(e j \m j ,A T 1 ), 


\j — E q *6j — pj A- . A P-j) 


(23) 


as well as an equation for E q »9 T 9. 

Note that A n must be invertible, for if it were not, E would not be invertible. 

The solution m = p is a unique stable point for Eq. ( [23] ), since the fixed point equations for each 
j can be stacked and rearranged to give 


m — p 


0 A 11 1 Ai2 ••• A 11 1 A i (j_i) A 11 1 A 1 j 

. AjjAji AjjAj2 ■ ■ ■ A J jAj(j_i) 0 

' A~ 1 • ■ ■ 0 • • • 0 " 


(to - p) 


0 

0 


0 


0 A i2 


0 


Ajl Aj2 


A 


-l 

jj J 


Ai(j-i) Au 

Aj(j-i) 0 


(m — p) <t=> 
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0 = 


A 


11 


0 


0 0 
o a 12 


(m — p) 


■■ A jj 

Ai(j-i) A u 


Aji A j 2 • • • A j(j-i) 

0 = A (m — p) -£=> 

TO = (i. 

The last step follows from the assumption that E (and hence A) is invertible. It follows that /.i is the 

□ 


0 


(to- — p) -£=> 


unique stable point of Eq. (23 i. 


Proposition B.2. Assume we are in the setting of Lemma B.l where additionally // and E are on 
the interior of the feasible parameter space. Then the LRVB covariance estimate exactly captures 
the true covariance, E = E. 

Proof Consider the perturbation for LRVB defined in Eq. 0- By perturbing the log likelihood, 
we change both the true means p t and the variational solutions, to*. The result is a valid density 


function since the original p and E are on the interior of the parameter space. By Lemma B.l the 
MFVB solutions are exactly the true means, so m t ,j = pt.j, and the derivatives are the same as well. 
This means that the first term in Eq. ( fTO} is not approximate, i.e. 

dm t _ jLih- a - v 

dt T dt TEpt ° St ’ 

It follows from the arguments in Appendix [Bjthat the LRVB covariance matrix is exact, and E = E. 

□ 


B.l Comparison with supplemented expectation-maximization 

This result about the multivariate normal distribution draws a connection between LRVB corrections 
and the “supplemented expectation-maximization” (SEM) method of H3l . SEM is an asymptoti¬ 
cally exact covariance correction for the EM algorithm that transforms the full-data Fisher infor¬ 
mation matrix into the observed-data Fisher information matrix using a correction that is formally 
similar to Eq. ( fT0| >. In this section, we argue that this similarity is not a coincidence; in fact the SEM 
correction is an asymptotic version of LRVB with two variational blocks, one for the missing data 
and one for the unknown parameters. 

Although LRVB as described here requires a prior (unlike SEM, which supplements the MLE), 
the two covariance corrections coincide when the full information likelihood is approximately log 
quadratic and proportional to the posterior, p(6\x). This might be expected to occur when we have 
a large number of independent data points informing each parameter—i.e., when a central limit 
theorem applies and the priors do not affect the posterior. In the full information likelihood, some 
terms may be viewed as missing data, whereas in the Bayesian model the same terms may be viewed 
as latent parameters, but this does not prevent us from formally comparing the two methods. 
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We can draw a term-by-term analogy with the equations in fl3l . We denote variables from the 
SEM paper with a superscript “SEM” to avoid confusion. MFVB does not differentiate between 
missing data and parameters to be estimated, so our 9 corresponds to (0 SEM , Y^ EM ) in ED- SEM 
is an asymptotic theory, so we may assume that (Q SEM ; Y^ EM ) have a multivariate normal distri¬ 
bution, and that we are interested in the mean and covariance of Q SEM . 

In the E-step of fm we replace Y^ EM with its conditional expectation given the data and other 
qSem' q-jjjg corresponds precisely to Eq. (23 i, taking 0 :l = Y rq EM . In the M-step, we find the 
maximum of the log likelihood with respect to Q SEM , keeping Y^ EM fixed at its expectation. Since 
the mode of a multivariate normal distribution is also its mean, this, too, corresponds to Eq. 
now taking 9j = 0 SEM . 

It follows that the MFVB and EM fixed point equations are the same; i.e., our M is the same as 
their M sem , and our dM/dm of Eq. (21 1 corresponds to the transpose of their DM sem , defined 
in Eq. (2.2.1) of lU3l . Since the “complete information” corresponds to the variance of () SEM with 
fixed values for Yq gg 1 , this is the same as our E f/ - 11 , the variational covariance, whose inverse is 


I oc }. Taken all together, this means that equation (2.4.6) of ED can be re-written as our Eq. (pLO)). 


ySEM =/ -l (j _ DM SEM ) 


-i 

-i 


-i 




C Multivariate normal mixture details 


In this section we derive the basic formulas needed to calculate Eq. ([TO]) and Eq. (14 1 for a finite 
mixture of normals, which is the model used in Section[6] We will follow the notation introduced in 
Section [6J] 

Let each observation, x n , be a P x 1 vector. We will denote the Pth component of the nth 
observation x n , with a similar pattern for z and /t. We will denote the p, qth entry in the matrix A/ ;: 
as A k.f>q ■ The data generating process is as follows: 


N 


logP (x n \z n ,p,A) = y ^Z nk log (t>k{x n )+C 


n—1 

1 T 1 

log <j>k{x) = -~(x-Hk) A k (x- n k ) + — log |A fc | +C 

K 


log P(z n k\Ek) = y~] ~nk log E k + C 

k —1 

N K f 1 T i \ 

log P(z, /X, 7 T, A|x) = log7T fc - - (x n - Hk) T Ak (x n - Mfc) + 2 lo S l Afc l ) + C 

n—l fc=l ' ' 

In all our results we simply used improper, flat priors, though it would be trivial to incorporate 
conjugate priors. 

From the assumptions in Eq. (12 1 , the posterior expectation of x* will always have \'l, q x* = x, 
so for notational convenience we can simply drop the * and apply the LRVB formulas as if x were 
a random parameter. However, it is useful to remember that the parameter x is different from the 
variable we condition on - we are actually estimating p(a, z, x*\x). 
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The parameters /z*;, Afc, n, and z n will each be given their own variational distribution. By 
standard results, the variational distributions will be: 


q /lk = Multivariate Normal 

QA k = Wishart 

q~ = Dirichlet 

q Zri = Multinoulli (one multinomial draw) 
q x * = Multivariate Normal 


The sufficient statistics for jik are all terms of the form Hkp and Hk P l>kr r Consequently, the 
sub-vector of 9 corresponding to fik is 


( ftfci N ' 
/kkp 

MfclMfcl 

MfclMfc2 

\ fifcPMfcP / 


We will only save one copy of Hkp^kq and Hkq^kp, so 9 Mic has length P + ^ (P + 1) P. For all 
the parameters, we denote the complete stacked vector without a k subscript: 


/ o IM 
\ 


The sufficient statistics for x* are analogous to those for //. 

The sufficient statistics for A/ c are all the terms A k, pq and the term log |A*|. Again, since A is 
symmetric, we do not keep redundant terms, so 0\ k has length 1 + ^ (P + 1) P. 

The sufficient statistics for 7r is the A'-vector (log7Ti, logi^x)- 
The sufficient statistics for z is simply the N x K values z n k themselves. 

In terms of Section[5] we have 


\ 9-k ) 
z = ( e z ) 

X = (9 X ) 


That is, we are primarily interested in the covariance of the sufficient statistics of /i, A, and 7r, z 
are nuisance parameters, and x* is the “unobserved” data. 

To put the log likelihood in terms useful for LRVB, we must express it in terms of the sufficient 
statistics, taking into account the fact the 9 vector does not store redundant terms (e.g. it will only 
keep A nh for a < b since A is symmetric). 


1 

2 


(Xn 


Afc {X r , 


Hk) 


--trace [A k (x n - (i k ) ( x n 
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- - 2 EE (A-k,ab (%n,a f-^k,a) (%n,b 

a b 

^ -aEE (A.k,abfJ'k,afJ'k,b A.k,ab%n,a/J j k,b ^-k,ab^n,bf^k,a T h-k,ab%n,a%n,b) 


a b 


cy ^ ^ ^-k,aa (/^fe) “1“ ^ ^ A-k,aa%n,alJ J k,a ~ ^ ^ A k,aa ( 


2 \ 2 


2 X! ^k,ab^k,a^k,b + ^2 A-k,ab'En,aH i k,b ^ ^ ^ ^-k,ab^'n,a^n,b 
a^b a^b a^b 


cy y ^ A.k,aa (/^fe) T" ^ ^ A-k,aa%n,afJ'k,a cy ^ ^ ^-k,aa ( 


£rJ - 


E Afc,af>/ifc,aMfc,& + X! ( X n,a^k,b + Xn tb ^k,a) ^ A-k,ab%n,a%n,b 


a<b 


a<.b 


a<b 


The MFVB updates and covariances in V are all given by properties of standard distributions. 
To compute the LRVB corrections, it only remains to calculate the hessian of H. These terms can 
be read directly off the posterior. First we calculate derivatives with respect to components of fi . 


d 2 H 


^f^k,a^-^-k,ab 

d 2 H 

& (/J'k,afJ'k,b) dA.k ,ab 

d 2 H 

d[kk,adZnk 

d 2 H 

d (fik,a^k,b) 9z nk 

All other // derivatives are zero. For A, 


= E 


Znk%n,b 


1 (a=b) 


E 


Znk 


y ^ ^-k,ab%n,b 
b 

Y \ 

2 


A, 


k.ab 


d 2 H 


dA.k,abdZnk 

d 2 H 

a log | Afe I dz nk 


l(a=6) 

2 


( X n ^ a X nb l^k,aXn,b fJ'k,bXn,a + Mfe ,a l^k,b) 


The remaining A derivatives are zero. The only nonzero second derivatives for log tt are to Z 
and are given by 


d 2 H 


= 1 


5 log 7 TjdZnk 

To calculate the influence scores, we additionally need second derivatives involving x. 


d 2 H 


dx n ^ a d^ K b 


— ^nk^k.ab 
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^nkf^k,b 


d 2 H 

d%n,adA-k,ab 

d 2 H 

d%n,adZnk 


b 


k,ab 


d 2 H 


& (Xn,aXn,b) ^^k,ab 

d 2 H 

& (Xn, a Xn,b) @Z n k 


Znk | 2 


= -A 


k,ab 


1 (a=b) 


1 (a=b) 


All other second derivatives involving x are zero. Note in particular that H z 
efficient calculation of Eq. (111. 


0, allowing 


D Influence scores 

D.l Influence score derivations 

In this section, we derive the formulas in Section [5] We will follow the notation there defined. 
Consider the “influence score” given by the derivative of the conditional expectation: 


me i {x n ) = E p [6 i \x 1 ,...,x n ,...,x N ] 

-f—nig ; ( x n ) := trig. ( x n ) 

CLXn 

A Taylor expansion of mg i (4) around x n gives 

mg i 04 ) = met ( x n ) + m'g. (x n ) ( x* n - x n ) + 

O (« - Znf) 

Multiplying both sides by (4 - x n ) and taking expectations conditional on x gives 


E [(mg, ( 4 ) - mSi ( Xn )) (4 ~ x n) \x ] 


= mg.(9i,x n )E {x* n -x n ) 2 \x +0^(4-ain) 3 ) 


On the left side. 


E [(mg i (4) - {x n )) (4 - x n ) \x] 

= E [E [(mg i (4) - m 6i (*„)) « - x n ) |4] |x] 

= E [E [(E [Oi\x i, .,4, .,xjv] - me; (ar„)) (4 - *n) 14] M 
= E [E [(0* - m 8i (x n )) (4 - a;„) 14] M 
= E [(Oi - m 6i (x n )) (4 - x n ) |at] 

= Cov (6» is 4|a:) 


On the right side. 


(4 *«) E (4 - x n ) 2 \x +0 ((4 - a; n ) 3 ^ 
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(6i,x n )e +O (e^ 


So that 


m' 0i (6i,x n ) = ^Cov (Oi,x* n \x) + O j 

...which is Eq. ( fi~3j ). 

We now assume that our parameter space can be divided into types of variables: a and z as 
before, and x, the perturbed data. As before, we also assume that each has its own variational 
distribution. 



As before, we use a similar partition for V and H. Specifically, 


V a 

0 

0 


0 

v x * 

0 


0 

0 

4 


H a 

XI ax* 

H az 

H x * c 

H x * 

H x * z 

H za 

H zx * 

H z 


We are interested in , the covariance between a and x, which can be interpreted as influence 
scores. 

The matrix T, x * is the result of an infinitesimal perturbation, and so will be nearly zero. We will 
write 


= eS :r 


Note that S x * is not necessarily diagonal if the variational distribution for each datapoint x n is 
multidimensional. Applying formula Eq. CD to eliminate z, we have 



( I a - V a H a 

\ tS x * H x *q 



V a H az 
cS x * H x * z 


Qz ( V z H za V Z H ZX , 



{ I a - V a H a -V a H ax , V a H az Q z V z H za V a H az Q z V z H zx . \ 

^ -eS x ,H x * q I x . J { eS x *H x * z Q z V z H za eS x *H x * z Q z V z H za J 

4 -V a H a - V a H az Q z V z H za - (V a H ax , + V a H nz Q z V z H zx ,) 1 _1 / 
-e (S x *H x * q + S x *H x * z Q z V z H za ) I x * - eS x *H x * z Q z V z H za _ ^ 

4 - V a H a - V a H az Q z V z H za -Q ax , V 1 fy a 0 \ 

xQx*a 4* X-Qx* \ 0 xS x * J 


v a 0 

0 eS x * 



4 Q 0 
0 eS x * 
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In the last step we have defined a some placeholder matrices called Q to simplify subsequent 
expressions: 


Qz 


Qo 

Qx 


Qx* 


(Iz - V.Hz )- 1 

V a H ax * + V a H az Q z V z H zx - 
S x -H xq + S X ,H X , Z Q Z V Z H Z 
S x -H x * z Q z V z H za 


This may appear to be a complicated expression, but it can be considerably simplified by using 
the fact that £ ax * ex e and e ~ 0, which allows us to eliminate all e terms that are second-order 
or higher. We can also use the Taylor expansion of the matrix inverse that gives, for e small, and 
invertible matrix A, 


(I-eB ) -1 = I + eB + 0(e 2 ) 

Again applying a Schur complement, we can write the expression for the upper-left corner: 

Sq, — (^a VaH a VaH az Q z V z H za tQax* (J-x* ^Qx*} Qx*a ^ 

= (J Q - V a H a - V Q H az Q z V z H za )- 1 14 + 0 (e) 


Note that as e —> 0, this gives the ordinary LRVB estimate for £ a , as expected. Infinitesimal 
perturbations to our data do not change our beliefs about the posterior covariance. Next, the Schur 
complement formula for the upper right corner gives 


£ 


ax* 


eS" 1 (Q ax * (I x * - eQx*) -1 ) S x * 

1 ( Qax * {jx* + tQx* + O (e"))) S x * 

£ £a 1 Qax* S x * + O (e 2 ) 

e£" 1 (y a H ax * + 14 H az (J 2 - VzHz)- 1 V Z H ZX .) S x . + O (e 2 ) 


Taking limits gives Eq. 


(14 1. 
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